Temporal action localization is an important yet challenging problem. Given along, untrimmed video consisting of multiple action instances and complexbackground contents, we need not only to recognize their action categories, butalso to localize the start time and end time of each instance. Manystate-of-the-art systems use segment-level classifiers to select and rankproposal segments of pre-determined boundaries. However, a desirable modelshould move beyond segment-level and make dense predictions at a finegranularity in time to determine precise temporal boundaries. To this end, wedesign a novel Convolutional-De-Convolutional (CDC) network that places CDCfilters on top of 3D ConvNets, which have been shown to be effective forabstracting action semantics but reduce the temporal length of the input data.The proposed CDC filter performs the required temporal upsampling and spatialdownsampling operations simultaneously to predict actions at the frame-levelgranularity. It is unique in jointly modeling action semantics in space-timeand fine-grained temporal dynamics. We train the CDC network in an end-to-endmanner efficiently. Our model not only achieves superior performance indetecting actions in every frame, but also significantly boosts the precisionof localizing temporal boundaries. Finally, the CDC network demonstrates a veryhigh efficiency with the ability to process 500 frames per second on a singleGPU server. We will update the camera-ready version and publish the sourcecodes online soon.
展开▼